Experimental Design for Highthroughput experiments

Susan Holmes, Wolfgang Huber

2017-07-13

Design of High Throughput Experiments

RA Fisher Father of Experimental Design
To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.:
(Fisher 1935) (Presidential Address to the First Indian Statistical Congress, 1938. Sankhya 4, 14-17).

Goals for this Lecture

The Art of “Good Enough”

Types of studies / experiments

Experiment

Retrospective observational studies

Prospective, controlled studies

Meta-analysis

Illustration: experiment

Well-characterized cell line growing in laboratory conditions on defined media, temperature and atmosphere.

We administer a precise amount of a drug, and after 72h we measure the activity of a specific pathway reporter.

Illustration: challenges with studies

We recruited 200 patients that have a disease, fulfill inclusion criteria (e.g. age, comorbidities, mental capacity) and ask them to take a drug each day exactly at 6 am. After 3 months, we take an MRI scan and lots of other biomarkers to see whether and how the disease has changed or whether there were any other side effects.

What to do about this?

Variability: noise and bias

Statisticians use the term error differently from its use in everyday language: deviation of a measured value from the ideal, true value.

Two sources of error:

Examples

What is a good normalization method?

library("DESeq2")
library("airway")
library("ggplot2")
library("dplyr")
library("gridExtra")
data("airway")
aw = DESeqDataSet(airway, design = ~ cell + dex) %>% estimateSizeFactors
sizeFactors(aw)

samples = c("SRR1039513", "SRR1039517") 

myScatterplot = function(x) {
  as_tibble(x) %>% 
  mutate(rs = rowSums(x)) %>%
  filter(rs >= 2) %>%
  ggplot(aes(x = log2(SRR1039513) + 2, 
             y = log2(SRR1039517) + 2)) + geom_hex(bins = 50) +
    coord_fixed() + 
    geom_abline(slope = 1, intercept = 0, col = "orange") + 
    theme(legend.position = "none")
}

grid.arrange(
  myScatterplot(counts(aw)),
  myScatterplot(counts(aw, normalized = TRUE)),
  ncol = 2)

What do we want from a good normalization method:

Possible figure of merit?

Occam’s razor

William of Ockham William of Ockham

If one can explain a phenomenon without assuming this or that hypothetical entity, there is no ground for assuming it.

One should always opt for an explanation in terms of the fewest possible causes, factors, or variables.

Error models

Noise is in the eye of the beholder

The efficiency of most biochemical or physical processes involving DNA-polymers depends on their sequence content, for instance, occurrences of long homopolymer stretches, palindromes, GC content.

These effects are not universal, but can also depend on factors like concentration, temperature, which enzyme is used, etc.

When looking at RNA-Seq data, should we treat GC content as noise or as bias?

One person’s noise can be another’s bias

We may think that the outcome of tossing a coin is completely random.

If we meticulously registered the initial conditions of the coin flip and solved the mechanical equations, we could predict which side has a higher probability of coming up: noise becomes bias.

Tosser Tosser

We use probabilistic modelling as a method to bring some order into our ignorance – but let’s keep in mind that these models are a reflection of our subjective ignorance, not an objective property of the world.

Biological versus technical replicates

Imagine we want to test whether a weight loss drug works. Which of the following designs is better?

This example shows the difference between technological versus biological replicates.

Some design decisions in HT biology are similar, if more subtle:

Quiz

For reliable variant calling with current sequencing technology, you need ca. \(30\times\) coverage per genome.

In 1000 genomes project, average depth of the data produced was 5.1 for 1,092 individuals. Why was that study design chosen?

What exactly is a technical versus biological replicate?

In fact, the terminology is too simplistic. Error can creep in at many different levels:

We will use the notion of blocks.

If we know about these nuisance factors and have kept track of them, we can (should) include them explicitely as bias terms in our models.

If we did not keep track, we can try to use latent factor models (SVA, PEER, RUV) to identify them from the data.

A lack of units

In the old days, we measured lengths in feet, arms, inches (first joint of an index finger), stones, …

International System of Units: meters, seconds, kilograms (soon), amperes, … use universal physical constants. A meter measured by a lab in Australia using one instrument has the same meaning as a meter measured a year later by a lab in Stanford using a different instrument, or by a space probe orbiting Venus.

Measurements in biology are, unfortunately, rarely that comparable.

Often, absolute values are not reported (these would require units), but only fold changes with regard to some local reference.

Even when absolute values exist (e.g., read counts in an RNA-Seq experiment) they usually do not translate into universal units such as molecules per cell or mole per milliliter.

Regular and catastrophic noise

Regular noise can be modelled by simple probability models such as independent normal distributions, Poissons, or mixtures such as Gamma-Poisson or Laplace.

We can use relatively straightforward methods to take noise into account in our data analyses.

The real world is different: measurements can be completely off scale (a sample swap, a contamination or a software bug), and multiple errors often come together: e.g. a whole microtiter plate went bad, affecting all data measured from it.

Such events are harder to model or even correct for – our best chance to deal with them is data quality assessment, outlier detection and documented removal.

Keeping track: Dailies

A film director will view daily takes, to correct potential lighting, shooting issues before they affect too much footage. It is a good idea not to wait until all the runs of an experiment have been finished before looking at the data.

Intermediate data analyses and visualizations will track eventual unexpected sources of variation in the data and enable you to adjust the protocol.

It is important to be aware of sources of variation as they occur and adjust for them.

Don’t confuse objectivity with stupidity.

Phylochip example

RNAi screen example

Basic principles in the design of experiments

Clever combinations and balancing: a motivating example

Weighing example A pharmacist’s balance weighing analogy (Hotelling (1944) and Mood (1946)).

Experimental design aims to maximize the available resources: capitalize on cancellations and symmetries

Hotelling devised an improved weighing scheme using experimental design.

Given a set of eight objects of unknown weights \(m=(m_1,\ldots,m_8)\).

For our experiment, we create a true \(m\) randomly.

set.seed(0xdada2)
m = sample(seq(2, 16, by = 2), 8)  + round(rnorm(8, 1), 1)
m
## [1] 11.6 11.7 16.4  2.5  4.3 13.8  8.8  7.4

Method 1: Naive method, eight weighings

Suppose we our scale has errors distributed normally with a SD of 0.1.

First, let’s weighs each \(m_i\) individually

We simulate the measurements (incl. individual errors) as follows. We also compute their overall variance as a measure of overall error.

X_1to8 = m + rnorm(length(m), mean = 0, sd = 0.1)
X_1to8
## [1] 11.634871 11.626921 16.373008  2.616853  4.062271 13.814796
## [7]  8.728899  7.501732
errors_1to8 = X_1to8 - m
errors_1to8
## [1]  0.03487057 -0.07307895 -0.02699227  0.11685334 -0.23772946
## [6]  0.01479591 -0.07110144  0.10173205
var(errors_1to8)
## [1] 0.01294371

Method 2: Hotelling’s method, eight weighings

library("survey")
h8 = hadamard(6)
coef8 = 2 * h8 - 1
coef8
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]    1    1    1    1    1    1    1    1
## [2,]    1   -1    1   -1    1   -1    1   -1
## [3,]    1    1   -1   -1    1    1   -1   -1
## [4,]    1   -1   -1    1    1   -1   -1    1
## [5,]    1    1    1    1   -1   -1   -1   -1
## [6,]    1   -1    1   -1   -1    1   -1    1
## [7,]    1    1   -1   -1   -1   -1    1    1
## [8,]    1   -1   -1    1   -1    1    1   -1

We use the columns of this matrix to guide our new weighing scheme.

The first column says we should put all 8 pieces on one side of the balance and weighs them, call this \(Y[1]\).

The second column says: place 1, 3, 5, 7 on one side of the balance and on the other and evaluate the difference. Call the results \(Y[2]\).

And so on.

Each of the eight weighings has a normal error with sd = 0.1.

Y = m  %*% coef8 + rnorm(length(m), mean = 0, sd = 0.1)
mhat = Y %*% t(coef8) / 8

Now, because in this case we know the true \(m\)s we can compute the errors and their variance:

errors_hotel = as.vector(mhat) - m
errors_hotel
## [1] -0.024792853  0.039493179  0.016281407 -0.065757174  0.011702539
## [6]  0.029188907 -0.006383201 -0.006189086
var(errors_hotel)
## [1] 0.001118038
var(errors_1to8) / var(errors_hotel)
## [1] 11.57717

Experiment by simulation

We saw that the second method had variance almost an order of magnitude smaller than the first. Were we just lucky?

tcoef8 = t(coef8) / 8
errs = replicate(10000, {
  m = sample(seq(2, 16, by = 2), 8) + round(rnorm(8, 1), 1)
  X_1to8 = m + rnorm(length(m), mean = 0, sd = 0.1)
  err_1to8 = X_1to8 - m
  Y = coef8 %*% m + rnorm(length(m), mean = 0, sd = 0.1)
  mhat = tcoef8 %*% Y
  err_hotel = mhat - m
  c(var(err_1to8), var(err_hotel))  
})

library("ggplot2")
library("tibble")
ggplot(tibble(ratio = errs[1,] / errs[2,]), aes(x = log10(ratio))) +
  geom_histogram(bins =  50)

We say that the second scheme is more efficient than the first by a factor of 8 because the errors generated by the measurement have a variance which is 8 times lower.

One factor at a time?

Ibn Sina (Avicenna) Physician Scientist
Avicenna The Physician His Canon of Medicine (1020) lists seven rules of experimental design, including the need for controls and replication, the danger of confounding and the necessity of changing only one factor at a time

This dogma was overthrown in the 20th century by RA Fisher.

Comparing two levels of one factor: healthy or diseased.

grid.arrange(p0,p0effect, ncol=2)

However if we color according to the batches:

grid.arrange(p0,p0batch, ncol=2)

We cannot conclude because we are in the presence of confounding.

An experiment with higher noise levels, same number of points as in the previous study sample sizes (2 x 6) is not enough, but with the same error and a bigger sample: (2 x 60).

grid.arrange(p1,pN1, ncol=2)

The experiment at \(n_1=n_2=6\) is not powerful enough.

t.test(exprst~states,data=batdef)
## 
##  Welch Two Sample t-test
## 
## data:  exprst by states
## t = -0.85641, df = 9.9022, p-value = 0.412
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.0243506  0.4560882
## sample estimates:
## mean in group healthy   mean in group tumor 
##              1.715022              1.999154

With the same effect size and a larger sample size, we have the power to see the difference:

t.test(exprsN~stateN,data=dfN)
## 
##  Welch Two Sample t-test
## 
## data:  exprsN by stateN
## t = -8.6555, df = 117.96, p-value = 2.951e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.5911068 -0.3709906
## sample estimates:
## mean in group healthy   mean in group tumor 
##              1.574949              2.055997

Summarize what we see in these boxplots.

Depends on:

Theory of multifactorial experiments.

Useful language

Decomposition of variability: analysis of variance.

Blocking : the case of paired experiments.

ZeaMays

ZeaMays

Each pot in Darwin’s Zea Mays experiment is a block, only the factor of interest should be different (pollination method), all other factors should be kept equal within a block.

A balanced design is an experimental design where all the different factor combinations have the same number of observation replicates. Such data are particularly easy to analyse because the effect of each factor is identifiable.

When there are (likely) nuisance factors, it is good to make sure they are balanced with the factors of interest. Sometimes this is inconvenient or impractical for logistic or economic reasons – but in such cases analysts are on thin ice and need to proceed with caution.

Comparing paired versus unpaired design

When comparing various possible designs, we do power simulations.

Here, we suppose the sample size is 15 in each group and the effect size is 0.2. We also need to make assumptions about the standard deviations of the measurements, here we suppose both groups have the same sd=0.25.

set.seed(45123)
n=15; effect=0.2
pots=rnorm(15,0,1)
noiseh=rnorm(15,0,0.25)
noisea=rnorm(15,0,0.25)
hybrid=pots+effect+noiseh
autoz=pots+noisea

Perform both a simple t.test and a paired t.test, which is more powerful in this case?

t.test(hybrid,autoz,paired=FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  hybrid and autoz
## t = 0.47985, df = 28, p-value = 0.6351
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.5669088  0.9137679
## sample estimates:
##  mean of x  mean of y 
## 0.24714393 0.07371438
t.test(hybrid,autoz,paired=TRUE)
## 
##  Paired t-test
## 
## data:  hybrid and autoz
## t = 2.1722, df = 14, p-value = 0.04751
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.002186349 0.344672751
## sample estimates:
## mean of the differences 
##               0.1734295

Maybe we were just lucky in our random simulation here.

Check which method is more powerful. Run a parametric bootstrap experiment, generate data as above B=1000 times and compute the average probability of rejection for these 1000 trials, with a false positive rate \(\alpha=0.05\).

B=1000 ; n=15 ;effect=0.2
ppaired=rep(0,B)
pttest=rep(0,B)
for (i in 1 :1000){
pots=rnorm(15,0,1)
noiseh=rnorm(15,0,0.25)
noisea=rnorm(15,0,0.25)
hybrid=pots+effect+noiseh
autoz=pots+noisea
pttest[i]=t.test(hybrid,autoz,paired=FALSE)$p.value
ppaired[i]=t.test(hybrid,autoz,paired=TRUE)$p.value}
sum(pttest<0.05)/B
## [1] 0.001
sum(ppaired<0.05)/B
## [1] 0.526

We can plot the p-values obtained using both methods:

library("ggplot2")
dtp=data.frame(pvalues=c(pttest,ppaired),
    experiment=factor(c(rep("notpaired",B),rep("paired",B))))
m = ggplot(dtp, aes(pvalues, fill=experiment))
m + geom_histogram(binwidth=0.01,alpha=0.3)

Exercises

powercomparison = function(effect=0.2,n=15,B=1000,
                noisesd=0.25,potsd=1){
ppaired=rep(0,B);pttest=rep(0,B)
for (i in 1 :B){
pots=rnorm(n,0,potsd)
noiseh=rnorm(n,0,noisesd)
noisea=rnorm(n,0,noisesd)
hybrid=pots+effect+noiseh
autoz=pots+noisea
pttest[i]=t.test(hybrid,autoz,paired=FALSE)$p.value
ppaired[i]=t.test(hybrid,autoz,paired=TRUE)$p.value
}
PowerPaired=sum(ppaired<0.05)/B
PowerUnpaired=sum(pttest<0.05)/B
powers=cbind(PowerPaired,PowerUnpaired)
return(powers)
}

Here are a few values showing that when the pots sd is smaller than the noise sd, pairing hardly makes a difference. If the pots variability is larger than that of the measurement noise then pairing makes a big difference.

powercomparison(potsd=0.5,noisesd=0.25)
##      PowerPaired PowerUnpaired
## [1,]       0.495         0.028
powercomparison(potsd=0.25,noisesd=0.5)
##      PowerPaired PowerUnpaired
## [1,]       0.168         0.126
powercomparison(potsd=0.25,noisesd=0.1)
##      PowerPaired PowerUnpaired
## [1,]       0.999         0.495
powercomparison(potsd=0.1,noisesd=0.25)
##      PowerPaired PowerUnpaired
## [1,]       0.524         0.482

For 100 plants of each type and the two SDs being 0.5, the power of the paired test is about 80%.

powercomparison(potsd=0.5,noisesd=0.5,n=100)
##      PowerPaired PowerUnpaired
## [1,]       0.788         0.497

take into account a natural pairing of the observations – for instance, twin studies, or studies of patients before and after a treatment. What can be done when pairing is not available.

try to create pairs of subjects that have as much similarity as possible through mathcing age, gender, background health etc. One is treated, the other serves as a control.

“Block what you can, randomize what you cannot”

(George Box, 1978)

Special designs: paired, complete block randomized,

Often we don’t know which nuisance factors will be important, or we cannot plan for them ahead of time.

In such cases, randomization is a practical strategy: at least in the limit of large enough sample size, the effect of any nuisance factor should average out.

Complete Random Block Design (CRB)

Complete randomized

Complete randomized

Balanced incomplete Block Design ()

balanced incomplete

balanced incomplete

Complete factorial Latin Squares

sudoku anyone?

sudoku anyone?

Randomization decreases bias.

Randomization helps inference.

Random does not mean haphazardly:

Controls, positive and negative: why?

We often need to remove variation due to unknown factors, or decompose variability according to its different sources; this is classically done using analysis of variance and mixed models that can accomodate random factors such as subject effects and fixed factors such as batch effects.

Usually these decompositions require at least 3 replicate measurements in each cell.

Removal of effects from unknown sources can only be done through the use of negative controls[^9].

Calibration of the effect size in an experiment also requires the use of positive controls; spike-ins (for instance External RNA Control Consortium controls as used in Risso et al. (2014)) where a known quantity or a known expression level aid in these calibrations and are a standard part of many experimental protocols.

Validation in independent data / independent lines of argument

If it is too good to be true, it usually is? An anecdote [^10]

How many replicates do I need?

We use preliminary data and simulation experiments calculating how many nucleotides were necessary to achieve a 80% true positive rate when we knew the alternative.

Now, recall the discussion of experiments versus studies.

For the cell line experiment, we might get the correct result already from one replicate; usually we’ll do two or three to be sure.

On the other hand, for the study, our intuition tells us that there is so much uncontrolled variability that 20 is likely far too few, and we may need 200 or 2,000 patients for a reliable result. The number of replicates needed is highly context specific. It depends on the amount of uncontrolled [^11] variability, and the effect size. A pragmatic approach is to check out previous successful (or unsuccessful) experiments or studies that did something comparable and use simulations, subsampling or bootstrapping to get an estimate of the prosed study’s power. Here are more details about how to go about this in practice.

Power depends on sample sizes, effect sizes and variability.

Effective sample size for dependent data.

Example: To have a power of 80% to distinguish two groups (\(\alpha=0.05\)) one needs samples of size 64 in each group (128 measurements).\

For two time points that have a correlation of \(\rho=0.6\) one needs a sample of \(N=50\) measured twice (ie 200 measurements ).

Perturbation Design

Perturbation Design

Incomplete blocks, ragged arrays, missing data.

Mean-Variance Relationships and Transformations

Previously we saw examples for data transformations whose that compress or stretch the space of quantitative measurements in such a way that the measurements’ variance is more similar throughout. Thus the variance is no loner highly dependent on the mean value[^14].

The mean-variance relationship of a measurement technology can in principle be any function, but in many cases, the following prototyic relationships hold:

  1. constant: the variance is independent of the mean.

  2. Poisson: the variance is proportional to to the mean.

  3. quadratic: the standard deviation is proportional to the mean, therefore the variance grows quadratically.

Give examples for biological assays (or measurement technologies) whose data show these types of relationships.

Real data can also be affected by a combination of these. For instance, with DNA microarrays, the fluorescence intensities are subject to a combination of background noise that is largely independent of the signal, and multiplicative noise whose standard deviation is proportional to the signal (Rocke and Durbin 2001). Therefore, for bright spots the multiplicative noise dominates, whereas for faint ones, the background.

Load up the raw data from a microarray experiment with replicates and verify the above statement.

We recall from Chapter ((???)) that for data with a mean-variance relationship \(v(\mu)\) the variance-stabilizing transformation \(g:\mathbb{R}\to\mathbb{R}\) fulfills the condition \[g'(x)=\frac{1}{\sqrt{v(x)}}\]

a) What are the variance-stabilizing transformations associated with the above three prototypic mean-variance relationships?
b) What is the variance stabilizing transformtaion which is appropriate for \(v(\mu) = v_0 + c\,\mu^2\), where \(v_0>0\) is a positive constant?

Data quality assessment and quality control

Data quality assessment (QA) and quality control (QC, i.e., the removal of insufficiently good data) are essential steps of any data analysis. Being critical of data quality (both of raw and derived data) should pervade all phases of analysis, from data import over model fitting, hypothesis testing to interpretation. The most useful tool for QA is usually visualisations. If they highlight anomalies, it’s necessary to decide which remedial steps to take (for instance, to exclude certain parts of the data), and, probably, to redo the analysis, either with the same or a refined method. These decisions are necessarily subjective and context-dependent.

It is helpful to be clear on what one means by quality, as the word can have many meanings. The most pertinent for us is fitness for purpose[^15]. Back to the example of RNA-Seq data, the purpose of the experiment is, in many cases, the detection of differentially expressed genes between different biological conditions. The aim of QA/QC will then be the identification of data points or groups of data points (e.g., all data from one sample, all data from one gene) that apparently suffered from an anomaly that makes them detrimental to this purpose. Ordination plots and heatmaps are useful to identify outliers: for instance, samples that are not behaving as expected, because of a sample swap or misannotation; or genes that were not measured properly.

Longitudinal Data

Longitudinal data[^16] have time as a covariate. The first question is whether we are looking at a handful of time points – say, the response of a cell line measured 48h, 72h and 84h after exposure to a drug; or a long and densely sampled time series such as patch clamp data in electrophysiology or a movie from life cell microscopy.

In the first case, time is usually best thought of as just another experimental factor, in the same way as we consider the concentration or the choice of drug. One analysis strategy could be to first identify the “best”, or biologically most indicative, time point, and then focus on that. Or we can ask whether there is any effect at all, regardless of the time. We then just need to make sure that we account for the dependencies between the measurements over time, and effective sample sizes. When designing the experiment, we’ll also try to sample more densely at those times when we expect most to happen.

In the second case, time series, we’ll often want to fit dynamical models to the data. We have many choices:

Don’t Pretend You Are Dumb

There is some attraction to seemingly “unbiased” approaches that analyse the data at hand without any reference to what is already known. Such tendencies are reinforced by the fact that statistical methods have often been developed to be generic, for instance, working of a general matrix without specific reference to what the rows and column might signify.

Generic approaches are a good way to get started, and for analyses that are highly powered and straightforward, such an an approach might work out. But often, it is wasteful. Recall the example of an RNA-Seq experiment for differential expression. As we saw in Chapters @(Chap14) and @(Chap7), we could perform a hypothesis test for differential expression for each annotated gene, regardless of its read counts, and then run a multiple testing method that treats all tests as exchangeable. But this is inefficient - we can improve our detection power by filtering out, or downweighting, the tests with lower signal-to-noise ratio.

Other examples include:

Sharpen Your Tools: Reproducible Research

Analysis projects often begin with a simple script, perhaps to try out a few initial ideas or explore the quality of the pilot data. Then more ideas are added, more data come in, other datasets are integrated, more people become involved. Eventually the paper needs to be written, figures be redone properly, and the analysis be saved for the scientific record and to document its integrity. Here are a few tools that can help with such a process.

Use an integrated development environment. RStudio is a great choice; there are also other platforms such as Emacs or Eclipse.

Use literate programming tools such as Rmarkdown or Jupyter. This is more readable (for yourself and for others) than burying explanations and usage instructions in comments in the source code or in separate README files, in addition you can directly embed figures and tables in these documents. Such documents are often good starting points for the supplementary material of your paper.

Anticipate re-engineering of the data formats and the software. The first version of how you choose to represent the data and structure the analysis workflows will rarely be the best. Don’t be afraid[^18] to make a clean cut and redesign them as soon as you notice that you are doing a lot of awkward data manipulations or repetitive steps. This is time well-invested. Sometimes it also helps to unearth bugs.

Reuse existing tools. Don’t reinvent the wheel and rather spend your time on things that are actually new. Before implementing a ‘heuristic’ or a temporary hack that analyses your data in a certain way, spend a couple of minutes researching to see if something like this hasn’t been done before. More often than not, it has, and there is a clean, scalable, and already tested solution.

Use version control, such as git. This takes some time to learn, but in the long run will be infinitely better than all your self-grown attempts at managing evolving code with version numbers, switches and the like. Moreover, this is also the sanest option for collaborative work on code, and it provides an extra backup of your codebase, especially if the server is distinct from your workplace machine.

Use functions rather than copy-pasting stretches of code.

Use the R package system. Soon you’ll note recurring function or variable definitions that you want to share between your individual scripts. It is fine to use the R function to manage them initially, but it is never to early to move them into your own package at the latest when you find yourself starting to write README files or long emails explaining others how to use some script or another. Assembling existing code into an R package is not hard by any means, and offers you many goodies including standardized and convenient ways to provide documentation, to show code usage examples, to test the correct functioning of your code, and to version it. Quite likely you’ll soon appreciate the benefit of using namespaces.

Centralize the location of the raw data files and streamline the derivation of intermediate data. Store the input data at a centralized file server that is professionally backed up. Mark the files as read-only. Have a clear and linear workflow for computing the derived data (e.g. normalised, summarised, transformed etc.) from the raw files, and store these in a separate directory. Anticipate that this workflow will need to be re-run several times, and version it. Use or similar tools[^19] to mirror these files on your personal computer.

Integration. When developing downstream analysis ideas that bring together several different data types, you don’t want to do the conversion from data type specific formats to the representations that machine learning or generic statistical methods use each time on an ad hoc basis. Have a recipe script that assembles the different ingredients and cooks them up as an easily consumable[^20] matrix, data frame or Bioconductor .
Keep a hyperlinked webpage with an index of all analyses. This is helpful for collaborators (especially if the page and the analysis can be accessed via a web browser) and also a good starting point for the methods part of your paper. Structure it in chronological or logical order, or a combination of both.

Data representation

Combining all the data so it is ready for analysis or visualisation often involves a lot of shuffling around of the data, until they are in the right shape and format for an analytical algorithm or a graphics routine.

Errors can occur, lost labels, lost information: be safe, redundancy is good.

Wide vs long table format

Recall Hiiragi data (for space reasons we print only the first five columns):

    ##             1 E3.25  2 E3.25  3 E3.25  4 E3.25  5 E3.25
    ## 1420085_at 3.027715 9.293016 2.940142 9.715243 8.924228
    ## 1418863_at 4.843137 5.530016 4.418059 5.982314 4.923580
    ## 1425463_at 5.500618 6.160900 4.584961 4.753439 4.629728
    ## 1416967_at 1.731217 9.697038 4.161240 9.540123 8.705340

This dataframe has several columns of data, one for each sample (annotated by the column names). Its rows correspond to the four probes, annotated by the row names. This is an example for a data table in wide format.

    ##   variable    value
    ## 1  1 E3.25 3.027715
    ## 2  1 E3.25 4.843137
    ## 3  1 E3.25 5.500618
    ## 4  1 E3.25 1.731217
    ## 5  2 E3.25 9.293016
    ## 6  2 E3.25 5.530016

In the resulting dataframe , each row corresponds to exactly one measured value, stored in the column . Then there are additional columns variable and value, which store the associated covariates.

Compare this to the dataframe above.

Now suppose we want to store somewhere not only the probe identifiers but also the associated gene symbols. We could stick them as an additional column into the wide format dataframe, and perhaps also throw in the genes’ ENSEMBL identifier for good measure. But now we immediately see the problem: the dataframe now has some columns that represent different samples, and others that refer to information for all samples (the gene symbol and identifier) and we somehow have to know this when interpreting the dataframe. This is what Hadley Wickham calls untidy data[^21]. In contrast, in the tidy dataframe , we can add these columns, yet still know that each row forms exactly one observation, and all information associated with that observation is in the same row.

Ragged arrays

In tidy data (???),

  1. each variable forms a column,

  2. each observation forms a row,

  3. each type of observational unit forms a table.

A potential drawback is efficiency: even though there are only 4 probe-gene symbol relationships, we are now storing them 404 times in the rows of the dataframe . Moreover, there is no standardisation: we chose to call this column , but the next person might call it or even something completely different, and when we find a dataframe that was made by someone else and that contains a column , we can hope, but have no guarantee, that these are valid gene symbols. Addressing such issues is behind the object-oriented design of the specialized data structures in Bioconductor, such as the class.

Matrices versus dataframes

For a specific data type, it may not always be the most efficient way of storing data, and it cannot easily transport rich metadata (i.e., data about the data).[^22] For instance, our example dataset is stored as an object in Bioconductor’s class SummarizedExperiment which has multiple components, most importantly, the matrix with 45101 rows and 101 columns. The matrix elements are the gene expression measurements, and the feature and sample associated with each measurement are implied by its position (row, column) in the matrix; in contrast, in the long table format, the feature and sample identifiers need to be stored explicitly with each measurement. Besides, has additional components, including the dataframes and , which provide various sets metadata about the microarray features and the phenotypic information about the samples.

Analysis Workflow Design

Don’t immediately rush into downstream (‘biological’) analysis, or run black-box algorithms. First make sure you understand the qualities of the data. Here are some questions and diagnostic plots:

Many examples of expression experiments with unfortunate designs exist. S. Lin et al. (2014) had an experimental design where the human and mouse samples were treated differently (different lanes, time to collect samples etc..) so that their confounding[^23] of batch factor and species factor precluded the verification of a difference between species (Gilad and Mizrahi-Man 2015).

Leaky pipelines and statistical sufficiency

Data analysis pipelines in high-throughput biology often work as funnels that successively summarise and compress the data. In high-throughput sequencing, we may start with individual sequencing reads, then align them to a reference, then only count the aligned reads for each position, summarise positions to genes (or other kinds of regions), then ???normalize??? these numbers by library size to make them comparable across libraries, etc. At each step, we loose some information, and it is important to make sure we still have enough information for the task at hand[^24]. The problem is particularly burning if we use a data pipeline built from a series of separate components without enough care being taken ensuring that all the information necessary for ulterior steps is conserved.

Statisticians have a concept for whether certain summaries enable the reconstruction of all the relevant information in the data: sufficiency. In a Bernoulli random experiment with a known number of trials, \(n\), the number of successes is a sufficient statistic for estimating the probability of success \(p\).

In a 4 state Markov chain (A,C,G,T) such as the one we saw in Chapter??@(Chap10), what are the sufficient statistics for the estimation of the transition probabilities?

Iterative approaches akin to what we saw when we used the EM algorithm can sometimes help avoid information loss. For instance, when analyzing mass spectroscopy data, a first run guesses at peaks individually for every sample. After this preliminary spectra-spotting, another iteration allows us to borrow strength from the other samples to spot spectra that may have been labeled as noise.

Parametric bootstrap for power simulations Power calculations need to be done before an experiment is run. However, when calculating power one needs to know estimates of quantities such as the effect size[^26] (the difference in means between two treatments) that will only be known after the experiment. The best way to work around this is to simulate with different possible values of the unknown parameters to get an idea of the actual sample sizes needed.

Generate data with:

To do this, we need to find a parametric family \(F_\theta\) that is not too terrible in modeling our data. Then we estimate \(\theta\) from the data generate whole sets of simulated data from the distribution \(F_{\hat{\theta}}\), while varying effect sizes, sample sizes, levels of replication, etc.

Summary

Issues should be carefully considered before doing an experiment.

There are two types of variation in an experiment: one is of interest, the other is unwanted.

We usually cannot rid ourselves of all the unwanted variation but we saw how we can used balanced randomized designs, data transformations that improve quality control.

We showed how to compute the power of our study by simulation experiments, these illustrated the necessity to carefully choose time points in longitudinal studies especially when perturbations are part of the design.

A reproducible workflow enables us to be transparent in our choices and enable us to evaluate the overall robustness of the results.

Technology specific issues

RNA-seq type experiments, metagenomics, chipseq

Read Coverage Depth Choice depends on the context and goal of the experiment. Sims et al. (2014) make some very important points about choices that have to be made before the experiment is carried out. With a finite number of runs and samples possible because of resource limitations, one had to …

Different problems require different depths

Mass spec, proteomics

See careful study by Oberg and Vitek (2009)

References and Further Reading

This lecture presented merely a pragmatic introduction to design,
there are many book long treatments of the subject that offer
detailed advice on setting up
experiments 

Wu and Hamada (2011); Box, Hunter, and Hunter (1978); Glass (2007)

Auer, Paul L, and RW Doerge. 2010. “Statistical Design and Analysis of RNA Sequencing Data.” Genetics 185 (2). Genetics Soc America: 405–16.

Box, George EP, William G Hunter, and J Stuart Hunter. 1978. Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. John Wiley & Sons.

Fisher, Ronald Aylmer. 1935. The Design of Experiments. Oliver & Boyd.

Gilad, Yoav, and Orna Mizrahi-Man. 2015. “A Reanalysis of Mouse Encode Comparative Gene Expression Data.” F1000Research 4. Faculty of 1000 Ltd.

Glass, David J. 2007. Experimental Design for Biologists. Cold Spring Harbor Laboratory Press.

Hotelling, Harold. 1944. “Some Improvements in Weighing and Other Experimental Techniques.” The Annals of Mathematical Statistics 15 (3). JSTOR: 297–306.

Lin, Shin, Yiing Lin, Joseph R Nery, Mark A Urich, Alessandra Breschi, Carrie A Davis, Alexander Dobin, et al. 2014. “Comparison of the Transcriptional Landscapes Between Human and Mouse Tissues.” PNAS 111 (48). National Acad Sciences: 17224–9.

Mood, Alexander M. 1946. “On Hotelling’s Weighing Problem.” The Annals of Mathematical Statistics. JSTOR, 432–46.

Oberg, Ann L, and Olga Vitek. 2009. “Statistical Design of Quantitative Mass Spectrometry-Based Proteomic Experiments.” Journal of Proteome Research 8 (5). ACS Publications: 2144–56.

Risso, Davide, John Ngai, Terence P Speed, and Sandrine Dudoit. 2014. “Normalization of Rna-Seq Data Using Factor Analysis of Control Genes or Samples.” Nature Biotechnology 32 (9). Nature Research: 896–902.

Rocke, David M, and Blythe Durbin. 2001. “A Model for Measurement Error for Gene Expression Arrays.” Journal of Computational Biology 8 (6): 557–69.

Sims, David, Ian Sudbery, Nicholas E Ilott, Andreas Heger, and Chris P Ponting. 2014. “Sequencing Depth and Coverage: Key Considerations in Genomic Analyses.” Nature Reviews Genetics 15 (2). Nature Publishing Group: 121–32.

Wu, CF Jeff, and Michael S Hamada. 2011. Experiments: Planning, Analysis, and Optimization. Vol. 552. John Wiley & Sons.